Постановка задачи

Neural Style Transfer

Underlying Principle

The principle is simple: we define two distances, one for the content ($D_C$) and one for the style ($D_S$). $D_C$ measures how different the content is between two images while $D_S$ measures how different the style is between two images. Then, we take a third image, the input, and transform it to minimize both its content-distance with the content-image and its style-distance with the style-image. Now we can import the necessary packages and begin the neural transfer.

Краткий конспект оригинальной статьи

1. Content representation

Let $ \overrightarrow{p} $ and $ \overrightarrow{x} $ be the original image and the image that is generated, and $P^l$ and $F^l$ their respective feature representation in layer $l$. We then definethe squared-error loss between the two feature representations

$$ L_{content}(\overrightarrow{p},\overrightarrow{x},l) = \frac{1}{2}\sum_{i,j}(F^l_{ij} - P^l_{ij})^2 $$

2. Style representation

To obtain a representation of thestyleof an input image, we use a feature space designed to capture texture informa-tion. This feature space can be built on top of the filter responses in any layer of the network. It consists of the correlations between the different filter responses, where the expectation is taken over the spatial extent of the feature maps. These feature correlations are given by the Gram matrix $G^l ∈ R^{N_l×N_l} $, where $G^l_{ij}$ is the inner product between the vectorised feature maps $i$ and $j$ in layer $l$:

$$ G^l_{ij} = \sum_{k}F^L_{ik}F^l_{jk} $$

Let $\overrightarrow{a}$ and $\overrightarrow{x}$ be the original image and the image that is generated, and $A^l$ and $G^l$ their respective style representation in layer $l$. The contribution of layer $l$ to the total loss is then

$$ E_l = \frac{1}{4N_l^2M_l^2}\sum_{i,j}(G^l_{ij} - A^l_{ij})^2 $$

and the total style loss is

$$ L_{style}(\overrightarrow{a},\overrightarrow{x}) = \sum_{l=0}^L\omega_lE_l $$

3. Style transfer

To transfer the style of an artwork $\overrightarrow{a}$ onto a photograph $\overrightarrow{p}$ we synthesise a new image that simultaneously matches the content representation of $\overrightarrow{p}$ and the style representation of $\overrightarrow{a}$. Thus we jointly minimise the distance of the feature representations of a white noise image from the content representation of the photograph in one layer and the style representation of the painting defined on a number of layers of the Convolutional Neural Network. The loss function we minimise is

$$ L_{total}(\overrightarrow{p},\overrightarrow{a},\overrightarrow{x}) = \alpha L_{content}(\overrightarrow{p},\overrightarrow{x}) + \beta L_{style}(\overrightarrow{a},\overrightarrow{x})$$

Importing Packages and Selecting a Device

Loading the Images

Now we will import the style and content images. The original PIL images have values between 0 and 255, but when transformed into torch tensors, their values are converted to be between 0 and 1. The images also need to be resized to have the same dimensions. An important detail to note is that neural networks from the torch library are trained with tensor values ranging from 0 to 1. If you try to feed the networks with 0 to 255 tensor images, then the activated feature maps will be unable to sense the intended content and style. However, pre-trained networks from the Caffe library are trained with 0 to 255 tensor images.

Now, let's create a function that displays an image by reconverting a copy of it to PIL format and displaying the copy using plt.imshow. We will try displaying the content and style images to ensure they were imported correctly.

Loss Functions

Importing the Model

Now we need to import a pre-trained neural network. We will use a 19 layer VGG network like the one used in the paper.

PyTorch’s implementation of VGG is a module divided into two child Sequential modules: features (containing convolution and pooling layers), and classifier (containing fully connected layers). We will use the features module because we need the output of the individual convolution layers to measure content and style loss. Some layers have different behavior during training than evaluation, so we must set the network to evaluation mode using .eval().

Additionally, VGG networks are trained on images with each channel normalized by mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]. We will use them to normalize the image before sending it into the network.

A Sequential module contains an ordered list of child modules. For instance, vgg19.features contains a sequence (Conv2d, ReLU, MaxPool2d, Conv2d, ReLU…) aligned in the right order of depth. We need to add our content loss and style loss layers immediately after the convolution layer they are detecting. To do this we must create a new Sequential module that has content loss and style loss modules correctly inserted.

Next, we select the input image. You can use a copy of the content image or white noise.

Gradient Descent

As Leon Gatys, the author of the algorithm, suggested here <https://discuss.pytorch.org/t/pytorch-tutorial-for-neural-transfert-of-artistic-style/336/20?u=alexis-jacq>__, we will use L-BFGS algorithm to run our gradient descent. Unlike training a network, we want to train the input image in order to minimise the content/style losses. We will create a PyTorch L-BFGS optimizer optim.LBFGS and pass our image to it as the tensor to optimize.

Finally, we must define a function that performs the neural transfer. For each iteration of the networks, it is fed an updated input and computes new losses. We will run the backward methods of each loss module to dynamicaly compute their gradients. The optimizer requires a “closure” function, which reevaluates the module and returns the loss.

We still have one final constraint to address. The network may try to optimize the input with values that exceed the 0 to 1 tensor range for the image. We can address this by correcting the input values to be between 0 to 1 each time the network is run.

Finally, we can run the algorithm.

Universal Style Transfer via Feature Transforms

Сделаем рализацию концепции, представленной в этой статье, основываясь на оригинальной реализации

Основная идея

We construct an auto-encoder network for general image reconstruction. We employ the VGG-19 as the encoder, fix it and train a decoder network simply for inverting VGG features to the original image. The decoder is designed as being symmetrical to that of VGG-19 network (up to Relu_X_1 layer), with the nearest neighbor upsampling layer used for enlarging feature maps. To evaluate with features extracted at different layers, we select feature maps at five layers of the VGG-19, i.e., Relu_X_1 (X=1,2,3,4,5), and train five decoders accordingly. The pixel reconstruction loss and feature loss are employed for reconstructing an input image.

WCT функция, взятая и дополненная из оригинальной реализации

Сравним результаты

Вывод

Видим, что Universal Style Transfer via Feature Transforms переводит стиль, сохраняя семантику объектов, но с некоторыми искажениями, которые можно объяснить претренированными decoder слоями на внешнем датасете. При этом реализация работает быстрее, чем просто модифицированный StyleLoss, хотя и реализация StyleLoss достаточно проста для неограниченного количества стилей.